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Abstract 

Background: Small RNA sequencing is commonly used to identify novel miRNAs and to determine their expression 
levels in plants. There are several miRNA identification tools for animals such as miRDeep, miRDeep2 and miRDeep*. 
mlRDeep-P was developed to identify plant miRNA using miRDeep's probabilistic model of miRNA biogenesis, but 
it depends on several third party tools and lacks a user-friendly interface. The objective of our miRPIant program is 
to predict novel plant miRNA, while providing a user-friendly interface with improved accuracy of prediction. 

Result: We have developed a user-friendly plant miRNA prediction tool called miRPIant. We show using 16 plant 
miRNA datasets from four different plant species that miRPIant has at least a 10% improvement in accuracy 
compared to miRDeep-P, which is the most popular plant miRNA prediction tool. Furthermore, miRPIant uses a 
Graphical User Interface for data input and output, and identified miRNA are shown with all RNAseq reads in a 
hairpin diagram. 

Conclusions: We have developed miRPIant which extends miRDeep* to various plant species by adopting suitable 
strategies to identify hairpin excision regions and hairpin structure filtering for plants. miRPIant does not require any 
third party tools such as mapping or RNA secondary structure prediction tools. miRPIant is also the first plant 
miRNA prediction tool that dynamically plots miRNA hairpin structure with small reads for identified novel miRNAs. 
This feature will enable biologists to visualize novel pre-miRNA structure and the location of small RNA reads relative 
to the hairpin. Moreover, miRPIant can be easily used by biologists with limited bioinformatics skills. 
miRPIant and its manual are freely available at http://www.australianprostatecentre.org/research/software/mirplant 
or http://sourceforge.net/projects/mirplant/. 
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Background 

miRNA is a class of non-coding endogenous small 
RNA that post transcriptionally regulates target genes 
[1]. miRDeep-P [2] is one of the most commonly used 
computational plant miRNA identification tool, which 
is based on the miRDeep [3] algorithm. 

The most challenging problem in identifying novel 
plant miRNA is to find a suitable genomic region as a 
miRNA precursor candidate (to test whether it forms 
hairpins) because the majority of precursor miRNA in 
plants are between 100-200 bp [4], which is much longer 
than those in animals. Approaches using a shorter miRNA 



precursor may result in false negatives if the miRNA is 
longer and more variable than the predicted precursor 
region. Conversely, using a longer candidate precursor 
region to test whether it forms a hairpin structure may 
result in a non-complimentary match for the mature 
miRNA within the candidate precursor miRNA. Thus, in 
miRPIant, after small RNA sequencing reads are mapped 
to the genome, genomic regions around mapped reads are 
extended by 200 bp to determine whether they form hair- 
pin structures. To ensure detection of short plant miRNA, 
we also scan 100 bp regions to see if we can detect a hair- 
pin. This strategy can detect bona fide miRNAs that would 
otherwise be missed if only the longer (200 bp) precursor 
candidate length was used. 
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Figure 1 Output display of predicted miRNA. The read location and number of reads are shown relative to the precursor hairpin structure. 
The red sequence represents the mature miRNA. 



The strategy for determining the precursor region is 
different between miRDeep-P and miRPlant. miRDeep-P 
determines the precursor region based on the genomic re- 
gion having overlapping reads, while miRPlant determines 
a precursor region based on the mature miRNA region (or 
highest expressed read). The latter strategy can reduce the 
number of false negative results [5,6], as it guarantees that 
the mature miRNA is located at the end of one arm of the 
stem loop. 

It is important that biologists with basic computer 
skills can easily use RNAseq tools in order to broaden 
research within this field. Thus, miRPlant was developed 
using the platform independent computer language Java. 
A Graphical User Interface (GUI) is employed whereby a 
complete pipeline analysis of raw data input is achieved 
in a few clicks of buttons: (.fastq files) - > mapping (.bam 
files) - > miRNA identification, expression, and second- 
ary structure display - > mRNA target prediction. To fur- 
ther streamline accessibility of miRPlant, the tool does not 
require any third party tool. miRPlant also has a detailed 
but concise data output display that can be exported 
for publication in different file formats such as eps, pdf 
and svg (Figure 1). miRPlant images are generated 
dynamically. 



Implementation 

miRPlant operations can be divided into the following 
stages: 

i. filter out reads if their length is out of the 10-23 bp 
range, or which have a read-quality below the 
criteria that is set by user. 

ii. aggregate exact reads into one. 

iii. map aggregated reads to the genome reference 
without mismatch. miRPlant uses the Java-coded 
bowtie [7] alignment algorithm. BAM format is 
used to store mapped reads. Please note that the 
attribute "XS" in the BAM file is used to record the 
copy number of the read as introduced by 
miRDeep''. 

iv. gather sequences in the reference genome flanking 
the RNAseq read (precursor miRNA region) to 
determine whether the genomic region forms a 
hairpin structure using the RNA secondary structure 
algorithm [8]. 

V. use the miRDeep model to calculate the score for 
each predicted miRNA to measure the strength of 
the prediction. A higher score equates to a higher 
probability that the predicted miRNA is true. 



miRPlant: An Integrated Tool for Identification of Plant miRNA from RNA Sequencing Data i Osativa 



Parameters 



Adapter [ tcgtatgccgtcttctgcttg precursor Ien9th[20| min loop length] 20 | flank length |To}ti ax inconRead Rati<jQ.l| 
miR Length | 18 | | 23 | min phred| 20 | max Multlmap| T01 | min readsl 5 \ min score 1 -10 



Mapping sequences 

fastq or Bam file c:\Jiyuan\sourceCode\miRplant\dat\Osativa\GSM278571.fastq 



submit 



Figure 2 Parameter settings for miRPlant. Adapter sequences need to be replaced as appropriate. Data processing by miRPlant depends on 
the extension of the input file. Mapping and identification is performed if the input file extension is "fastq" or ".fa". Only identification is 
performed if the file extension is ".bam". Output ".result" files are shown after clicking "submit". 
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The miRPlant interface enables users to customize 
parameters since different plant species may have dif- 
ferent miRNA biogenesis [2] (Figure 2). The default 
precursor miRNA length is set to 200 bp. Here the pre- 
cursor length represents the length between the mature 
miRNA and the mature star miRNA; the two flanking 
sequences are excluded. miRPlant generates six output 
files similar to miRDeep*. Since the precursor length of 
plant miRNA is much longer than that of animals, the 
distance between the mature miRNA and mature star 
miRNA may be very long, which may result in the for- 
mation of an internal loop. Therefore, miRPlant allows 
for internal loops. The default minimum loop (includ- 
ing the distance from loop ends to the mature or star 
mature miRNA) size is 25 bp. In predicting mature 
miRNA, miRPlant requires less than 10% (max incon- 
Read Ratio option in GUI) of reads falling out of the 
predicted miRNA and star mature miRNA sequence. 
In miRDeep, RNAseq reads in the loop are counted as 
being consistent, but plant miRNA have very long 
loops. Thus, we exclude reads located within the loop 
region. The other parameters are the same as with 
miRDeep*. 

Results and discussion 

miRPlant has been tested on two rice datasets [9]. Both 
miRPlant and miRDeep-P employ the miRDeep score cal- 
culation, with miRPlant having better performance than 
miRDeep-P (Table 1), largely because miRPlant uses a 
flexible method to form the precursor candidates from the 
genomic region surrounding RNAseq reads. We set a 
minimum score of four when using miRPlant. A detailed 
summary of results can be found in Additional file 1 using 
GEO access number GSM278571 and GSM278572 for the 
RNAseq datasets. 

To further confirm the advantaged of miRPlant, we 
have extended this analysis to three more species (Arabi- 
dopsis thaliana, Medicago truncatula and Prunus per- 
sica) comprising 16 small RNA sequencing datasets 
(Detailed information in Additional file 2). To compare 
the two tools, we rank the predicted miRNAs in de- 
scending order of score for each tool, and then take the 
top 100 miRNAs from miRPlant and miRDeep-P for 
our comparison. We show that miRPlant consistently 
outperforms these other tools in all samples (Table 2, 
Additional files 3 and 4). 



Table 1 Comparison table 





Rice (GSM278571) 


Rice (GSM278572) 


Tool 


miRDP miRPlant 


miRDP miRPlant 


Precision 


0.82(31/38) 0.95(36/38) 


0.7 (44/63) 0.83 (52/63) 


Recall 


0.22 (31/144) 0.25 (36/144) 


0.24(44/181) 0.29(52/181) 



Precision = l<nown IVIIR/predicted MIR Recall = known IVllR/total known MiR. 



Table 2 Comparison table (ATH, MTR, PPE) 





A. thaliana 


M. truncatula 


P. persica 




(Number of 


(Number of 


(Number of 




known 


known 


known 




miRNA: 121) 


miRNA: 196) 


miRNAs: 75) 


Tool 


miRDP miRPlant 


miRDP miRPlant 


miRDP miRPlant 


Precision 


0.405 0.51 


0.22 0.66 


0.2 0.55 


Recall 


0.35 0.65 


0.10 0.325 


0.29 0.65 



Precision = known MiR/predicted MiR Recall = known MiR/total known MiR. 



Conclusions 

miRPlant is modelled off miRDeep* [5] for use with 
plant small RNA sequencing data. We have integrated 
all third party tools such as genomic mapping and RNA 
secondary structure prediction [8] into a Java library, 
which is seamlessly integrated into miRPlant. 

Availability and requirements 

Project name: miRPlant. 

Project home page: http://www.australianprostatecen- 

tre.org/ research/ software/ mirplant. 

Operating system (s): Windows, Linux, Mac OS. 

Programming language: Java. 

Other requirements: JRE. 

License: GNU General Public License. 

Any restrictions to use by non-academics: None. 

Additional files 



Additional file 1: List of all identified miRNAs from two rice small 
RNAseq data. 

Additional file 2: Small RNA sequencing data details. 
Additional file 3: Detailed result of miRPlant prediction. 
Additional file 4: Detailed result of miRDeep-P prediction. 
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